-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Allow s390x to load little endian models unmodified #11234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow s390x to load little endian models unmodified #11234
Conversation
|
I don't have any big endian systems for testing or maintenance. Do you pledge to help with maintaining big endian support long-term? |
|
Yes, we intend to maintain big endian support long term. We could also help to integrate Z VM guest as test system into CI if you'd like to. |
ggml/src/ggml.c
Outdated
| #if defined(__gnu_linux__) | ||
| #include <endian.h> | ||
| #else | ||
| #define le64toh(x) (x) | ||
| #define le32toh(x) (x) | ||
| #define le16toh(x) (x) | ||
| #endif | ||
|
|
||
| // endianness conversion | ||
| #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ | ||
| #define convert_from_le16(x) UNUSED(x) | ||
| #define convert_from_le32(x) UNUSED(x) | ||
| #define convert_from_le64(x) UNUSED(x) | ||
| #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ | ||
| static inline void convert_from_le16(void * value) { | ||
| *((uint16_t*)value) = le16toh(*((uint16_t*)value)); | ||
| } | ||
|
|
||
| static inline void convert_from_le32(void * value) { | ||
| *((uint32_t*)value) = le32toh(*((uint32_t*)value)); | ||
| } | ||
|
|
||
| static inline void convert_from_le64(void * value) { | ||
| *((uint64_t*)value) = le64toh(*((uint64_t*)value)); | ||
| } | ||
| #else | ||
| #error Unexpected or undefined __BYTE_ORDER__ | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be defined once in ggml-impl.h. Also check the newly updated contributing guidelines regarding the closing of preprocessor directives.
| static void ggml_byteswap_q4_0(void * restrict buffer, size_t elements) { | ||
| block_q4_0 *data_ptr = (block_q4_0*) buffer; | ||
| for (size_t i = 0; i < elements; ++i) { | ||
| convert_from_le16(&(data_ptr[i].d)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is only the endianness for the scale d but not the quantized values q being changed? More generally, quantized models require a lot of bit manipulation in order to work. Did you test this code with any quantized models or only with models using conventional, scalar datatypes such as FP16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check this function, but ggml_byteswap_q4_k and ggml_byteswap_q6_k are used. This function I've implemented similarly.
I can disable all functions which are not explicitely tested yet, or I could test them if you could point me to models using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally speaking, I've thought about the problem of endianness some more and there are quite a lot of potential pitfalls because we do a lot of low-level bitwise manipulations. There's also e.g. the question of how to handle interaction with backends other than CPU which would likely still assume little endianness. Quite frankly I'm not sure whether this is a feature that would be worth the complexity to support given how rare big endian processors are. My personal opinion is that something like this would be fine to merge if it does not interfere with any little endian code and it is clear that big endian issues are wholly your responsibility to determine and resolve. @ggerganov @slaren your input would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be hard to justify the maintenance cost. I am not sure how this interacts with the existing support for big endian gguf files in the gguf python library and convert_hf_to_gguf.py. It might make more sense to instead improve on that, for example by adding a tool to convert a gguf file to a different endianess. Model repositories like ollama can deal with this by serving both versions on demand, if they wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're correct about existing big endian files. I'll add a cmdline switch to disable byteswapping on s390x to allow loading such existing files.
Everyone uploading 2 copies of models, little endian and big endian, unfortunately is unlikely to happen. Due to that we'd like to have a solution to load little endian models on big endian system.
Of course, byteswapping in advance would help, and would enable, for example, using mmap. But additional big endian specific step isn't always possible, like in case with ollama, and can be just put inside loading model, like in this PR. And while I didn't measure performance, I expect loading model to take considerably less time than actually running model, and I expect byteswapping during loading not changing this noticeably.
As for gguf python library, if there is a testsuite outside of one described here (https://github.com/ggerganov/llama.cpp/blob/master/ci/README.md), then I could also check it.
|
If I may chime in on the consideration of support, this PR is a pretty big step forward as an umbrella support for zSystems and problematic/unsupported models while we actively work on implementing SIMD support individually (taronaeo/llama.cpp-s390x@891922f). At current, even with $ build/bin/llama-cli -m /opt/hf_models/llama-3.2-1b-be.F32.gguf -n 50 -t 8 -p "Write me a dog walking business idea 1." --seed 1568795874
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : VXE = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 1568795874
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 50, n_keep = 1
Write me a dog walking business idea 1.GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
llama_perf_sampler_print: sampling time = 3.97 ms / 62 runs ( 0.06 ms per token, 15625.00 tokens per second)
llama_perf_context_print: load time = 702.51 ms
llama_perf_context_print: prompt eval time = 459.48 ms / 12 tokens ( 38.29 ms per token, 26.12 tokens per second)
llama_perf_context_print: eval time = 9103.29 ms / 49 runs ( 185.78 ms per token, 5.38 tokens per second)
llama_perf_context_print: total time = 9602.03 ms / 61 tokensIf support to convert little-endian to big-endian is provided on-the-fly, albeit understandably with performance penalties, we can see models such as Llama-3.2 start running properly using little-endian while we work on a patch to make it work using big-endian. |
|
I've implemented '--no-byteswap' switch and tested it with model converted with help of #11349 |
Introduce byteswap function for ggml data. Implement some of them. Currently tested on llama3.2.
43544fb to
a9db9b0
Compare
|
Rebased PR to resolve merge conflicts |
|
I've added option to disable byteswapping on s390x. While byteswapping could be implemented for little-endian systems, I believe most models are little-endian and there's no demand for such option at the moment. I'd like to proceed with this PR. What's needed for that? |
|
I don't think this can be merged in its current state. As it is, it just going to be a maintenance burden and will increase the technical debt that we will need to pay when eventually we come to work on this code again and have to figure a way to deal with this mess. Fundamentally, the problem is that the gguf loader is not designed to support byteswapping. It just wasn't a consideration when it was created, and it would take a major effort to redesign the system to allow transparent byteswapping during loading. Additionally, you need to figure how to deal with big endian GGUF files. Adding a command line flag to just dump the responsibility of figuring that to the user is not a good way to deal with this. I can see two options, remove support for big endian GGUF files entirely, or add the necessary logic to support both types. Alternatively, you could instead extend the existing support for big endian GGUF files. This would be by far the easiest route, and the most likely to be merged. |
|
I've checked gguf-py and it evaluates byte order based on version field. I'll remove cmdline switch and remake c++ code using same logic. As for byteswapping itself, I'll try reworking it. |
Bitfields are allocated in different order on s390x
Add ggml_ prefix to their names.
Now little-endian systems can load big-endian models too. Model endianness heuristic is based on guessing using model version field. Additional fixes for testsuite after removing capability to write non-native endian files.
a9db9b0 to
3ed07ff
Compare
|
I've reworked PR a bit. First of all, I've removed the command line option I introduced earlier. Now there is a heuristic similar to one in python code, where it guesses endian type based on value of version field. I've removed capability to save gguf files in non-native endianness. It automatically saves files in native endian. I didn't check byteswapping during asynchronous data loading even with cuda on x86 in llama_model_loader::load_all_data, because I couldn't force it to switch to this code path. I got upload_backend == nullptr every way I tried it. I'm pretty sure it should work, but If you could point me on how to force llama.cpp to run this code path, I could verify that it works too. Please let me know if there are still any pain points present in this PR. |
|
Are those 3 failures related to changes in this PR? If yes, what exactly is wrong? |
|
The changes look like a good improvement, however the main problem is still that this is not transparent to the users of the gguf library, and requires additional code to check and byteswap the tensor data if necessary. llama.cpp is not the only application of ggml. Ideally, this should be completely transparent to the user, but that's not possible without also moving the data loading code (including mmap support) to the gguf library, which would be a major change. When the gguf loader is used without As it is, other applications that use the gguf library will fail silently when loading a gguf file with a non-native byte order, which is not great. It will also force us to add all the boilerplate for byteswapping to all new code using the gguf library. I am not saying that this is necessarily a deal-breaks when it comes to the decision of merging this PR, but it would make the decision a lot easier if these issues were addressed. If that's not possible, I will defer to @ggerganov or other maintainers. |
I am not sure what's the issue, but I don't think it's caused by the changes in this PR. Merging the latest changes from |
|
If I understand correctly, on The ideal solution is indeed to handle all this seamlessly in the |
What if I add new field in
I'll disable byteswapping if
As for maintaining byteswapping parts and/or big endian, me and my colleagues (ATM https://github.com/Andreas-Krebbel when I'm not available) are willing to support those parts. Feel free to assign any issue arising in the context of this to me or my colleagues. We could also set up a s390x CI runner to validate any changes on a BE target.
While the IBM Z platform may be a niche platform in terms of the number of systems out there, it powers critical workloads around the world in finance, airline reservations, insurance, retail and many more. Among the biggest players in these industries, the IBM Z platform is seeing good market adoption. The ability to run llama.cpp on IBM Z without too much hassle allows large enterprises to use it in combination with their traditional workloads. |
Enable it for llama.cpp
|
Disabling byteswapping when |
|
My understanding is that BE systems can already run llama.cpp, but it's inconvenient. So BE support with mainline model files would be a convenience gain for only a small number of people. How critical those BE systems are is not relevant. So any implementations must be done in a way that is minimally invasive and does not induce any side effects. |
While llama.cpp on BE systems can run BE models, it cannot run LE models. One of users of llama.cpp is ollama, and it does not work on BE systems because ollama downloads a model from an online repository, and that model is always LE model. There could be BE models there, but I have not seen any yet in their repositories. And while it is possible to run an external python script to convert model from LE to BE or other way around for llama.cpp, for ollama there is no good place available where this could be done as far as I can see at the moment. The only good option is to do it on the fly.
I'm trying to create such implementation and I'd be grateful if you'd point me to any remaining such issues in current implementation. |
|
Do you have any suggestions to improve this code further in order to bring it to acceptable state? |
|
I am afraid the changes still add to much extra logic that we will have to work around over time and it does not justify the small added benefit. Currently this code handles the quantization types on a case-by-case basis, which is OK but not great. However a GGUF can contain more generic data (such as the unicode strings that you added code to handle in
That's a good point and I think the better solution would be for providers like ollama to offer big-endian models seamlessly. However, my understanding is that ollama no longer uses |
FWIW @AlekseiNikiforovIBM and @ggerganov RamaLama would probably pick up the slack here, it's one of the reasons RamaLama exists :) If it works with upstream llama.cpp we can make it work with RamaLama |
|
Closing this PR as abandoned |
Allow loading little endian models on big endian systems.
This would allow using any models downloaded via ollama unmodified.